The task of estimating the maximum number of concurrent speakers from singlechannel mixtures is important for various audio-based applications, such asblind source separation, speaker diarisation, audio surveillance or auditoryscene classification. Building upon powerful machine learning methodology, wedevelop a Deep Neural Network (DNN) that estimates a speaker count. While DNNsefficiently map input representations to output targets, it remains unclear howto best handle the network output to infer integer source count estimates, as adiscrete count estimate can either be tackled as a regression or aclassification problem. In this paper, we investigate this important designdecision and also address complementary parameter choices such as the inputrepresentation. We evaluate a state-of-the-art DNN audio model based on aBi-directional Long Short-Term Memory network architecture for speaker countestimations. Through experimental evaluations aimed at identifying the bestoverall strategy for the task and show results for five seconds speech segmentsin mixtures of up to ten speakers.
展开▼